CAP: Evaluation of Persuasvie and Creative Content

Abstract

We address the task of advertisement image generation and introduce three evaluation metrics to assess Creativity, prompt Alignment, and Persuasiveness (CAP) in generated advertisement images. Despite recent advancements in Text-to-Image (T2I) generation and their performance in generating high-quality images for explicit descriptions, evaluating these models remains challenging. Existing evaluation methods focus largely on assessing alignment with explicit, detailed descriptions, but evaluating alignment with visually implicit prompts remains an open problem. Additionally, creativity and persuasiveness are essential qualities that enhance the effectiveness of advertisement images, yet are seldom measured. To address this, we propose three novel metrics for evaluating the creativity, alignment, and persuasiveness of generated images. Our findings reveal that current T2I models struggle with creativity, persuasiveness, and alignment when the input text is implicit messages. We further introduce a simple yet effective approach to enhance T2I models’ capabilities in producing images that are better aligned, more creative, and more persuasive.

Metrics (CAP)

To address the evaluation gap, the authors propose three novel metrics to assess generated images based on their Creativity, Alignment with the prompt, and Persuasiveness.

Creativity (C_obj)

Creativity is defined as the image's uniqueness while still effectively conveying the intended ad message. The metric is calculated as a ratio: the AIM score (for relevance) is divided by the average CLIP similarity between the generated image and the objects explicitly mentioned in the text (to measure uniqueness).

Process of computing C_obj

Alignment of Image and Message(AIM)

This metric is deisnged to evaluate the semantic and visual alignment between the message and the image, capturing both semantic and visual mismatch unlike other evaluation metrics focused on only visual mismtach.

The process involves two steps:

A Multimodal Large Language Model (MLLM) generates a detailed description of the generated ad image.
A fine-tuned Large Language Model (LLM) uses this description to generate a new action-reason statement (AR_gen) that it interprets from the image.

The final AIM score is the semantic similarity between the original input message (AR_m) and the newly generated statement (AR_gen).

Overview of AIM. Orange denotes training, while blue is inference. AR_w and AR_l are used in training as the preferred and dis-preferred statements. AR_m is the prompt for the T2I model.

Persuasiveness (P_comp+AIM)

This metric evaluates an image's ability to be convincing to its intended audience. It combines scores from multiple components grounded in prior persuasion literature. An LLM is prompted with questions to score the image's ability to appeal to a specific audience, convert features to benefits, and use rhetorical appeals (Ethos, Pathos, Logos). It also scores visual qualities like elaboration, originality, imagination, and synthesis. The final persuasiveness score is a weighted average of these component scores and the AIM score.

Process of computing P_comp+AIM persuasiveness score

Qualitative Analysis

Human-created ads (a, d) convey their message through creative and persuasive visual storytelling, blending the implicitmessage seamlessly into the visuals. The T2I baseline (b, e), on the other hand, depicts relevant entities but without the underlying intent. This highlights the need for better metrics for persuasiveness, creativity, and abstract text-image alignment. Ours (c, f) demonstrates how improving visual storytelling can enhance ad generation by aligning visuals with the intended message effectively. Texts on the right form the prompt to T2I models.

Example of images chosen by each annotator between ILLM and IAR. For each pair of images, annotators select the image that better aligns with each ARm. In each row, the value under each image indicates the score generated by the metric listed for that row. A ✓ represents the chosen image, while a × indicates the rejected image. The green circle highlights agreement with human annotations in choosing the better-aligned image, and the red circle indicates disagreement.

Example of images chosen by human annotators for each component in Pcomp score compared to Pcomp score. In each row, a ✓ represents the chosen image for the corresponding component by human or the persuasiveness component, while a ×indicates the rejected image. Green represents the agreement of score with human annotator while red represents disagreement.

Creativity: (d) which shows unique but relevant object portrayal scores best, while random (a) and generic objects (b) score worst. Red/Green shows low/high values for Cobj , low/high for AIM, & high/low for Sim. Blue denotes moderate values.

CAP 🧢: Evaluation of Persuasvie and Creative Image Generation

Abstract

Metrics (CAP)

Process of computing C_obj

Overview of AIM. Orange denotes training, while blue is inference. AR_w and AR_l are used in training as the preferred and dis-preferred statements. AR_m is the prompt for the T2I model.

Process of computing P_comp+AIM persuasiveness score

Qualitative Analysis

Creativity: (d) which shows unique but relevant object portrayal scores best, while random (a) and generic objects (b) score worst. Red/Green shows low/high values for Cobj , low/high for AIM, & high/low for Sim. Blue denotes moderate values.

Poster

BibTeX

CAP 🧢: Evaluation of Persuasvie and Creative Image Generation

Abstract

Metrics (CAP)

Process of computing Cobj

Overview of AIM. Orange denotes training, while blue is inference. ARw and ARl are used in training as the preferred and dis-preferred statements. ARm is the prompt for the T2I model.

Process of computing Pcomp+AIM persuasiveness score

Qualitative Analysis

Creativity: (d) which shows unique but relevant object portrayal scores best, while random (a) and generic objects (b) score worst. Red/Green shows low/high values for Cobj , low/high for AIM, & high/low for Sim. Blue denotes moderate values.

Poster

BibTeX

Process of computing C_obj

Overview of AIM. Orange denotes training, while blue is inference. AR_w and AR_l are used in training as the preferred and dis-preferred statements. AR_m is the prompt for the T2I model.

Process of computing P_comp+AIM persuasiveness score